{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1SmE2CODfmmL"
   },
   "source": [
    "# Ungraded Lab: Generating Sequences and Padding\n",
    "\n",
    "In this lab, you will look at converting input sentences into numeric sequences. Similar to images in the previous course, you need to prepare text data with uniform size before feeding it to your model. You will see how to do these in the next sections."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JiFUJg-lmTm6"
   },
   "source": [
    "## Text to Sequences\n",
    "\n",
    "In the previous lab, you saw how to use the `TextVectorization` layer to build a vocabulary from your corpus. It generates a list where more frequent words have lower indices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "LXzsIYWMvFM-"
   },
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "\n",
    "# Sample inputs\n",
    "sentences = [\n",
    "    'I love my dog',\n",
    "    'I love my cat',\n",
    "    'You love my dog!',\n",
    "    'Do you think my dog is amazing?'\n",
    "    ]\n",
    "\n",
    "# Initialize the layer\n",
    "vectorize_layer = tf.keras.layers.TextVectorization()\n",
    "\n",
    "# Compute the vocabulary\n",
    "vectorize_layer.adapt(sentences)\n",
    "\n",
    "# Get the vocabulary\n",
    "vocabulary = vectorize_layer.get_vocabulary()\n",
    "\n",
    "# Print the token index\n",
    "for index, word in enumerate(vocabulary):\n",
    "  print(index, word)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0VNFxYidr9qg"
   },
   "source": [
    "You can then use the result to convert each of the input sentences into integer sequences. See how that's done below given a single input string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "lQWcXlE1saUS"
   },
   "outputs": [],
   "source": [
    "# String input\n",
    "sample_input = 'I love my dog'\n",
    "\n",
    "# Convert the string input to an integer sequence\n",
    "sequence = vectorize_layer(sample_input)\n",
    "\n",
    "# Print the result\n",
    "print(sequence)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6ZZnENfZtoiA"
   },
   "source": [
    "As shown, you simply pass in the string to the layer which already learned the vocabulary, and it will output the integer sequence as a `tf.Tensor`. In this case, the result is `[6 3 2 4]`. You can look at the token index printed above to verify that it matches the indices for each word in the input string.\n",
    "\n",
    "For a given list of string inputs (such as the 4-item `sentences` list above), you will need to apply the layer to each input. There's more than one way to do this. Let's first use the `map()` method and see the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "41foIDBQw3FA"
   },
   "outputs": [],
   "source": [
    "# Convert the list to tf.data.Dataset\n",
    "sentences_dataset = tf.data.Dataset.from_tensor_slices(sentences)\n",
    "\n",
    "# Define a mapping function to convert each sample input\n",
    "sequences = sentences_dataset.map(vectorize_layer)\n",
    "\n",
    "# Print the integer sequences\n",
    "for sentence, sequence in zip(sentences, sequences):\n",
    "  print(f'{sentence} ---> {sequence}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yV_91IQB62R0"
   },
   "source": [
    "As you can see, each sentence is successfully transformed into an integer sequence. The problem with this is they have varying lengths so it cannot be consumed by the model right away."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "z56pEkF2p8c-"
   },
   "source": [
    "## Padding\n",
    "\n",
    "You can get a list of varying lengths to have a uniform size by padding or truncating tokens from the sequences. Padding is more common to preserve information.\n",
    "\n",
    "Recall that your vocabulary reserves a special token index `0` for padding. It will add that token (called post padding) if you pass in a list of string inputs to the layer. See an example below. Notice that you have the same output as above but the integer sequences are already post-padded with `0` up to the length of the longest sequence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "DJpjZvG9wtLP"
   },
   "outputs": [],
   "source": [
    "# Apply the layer to the string input list\n",
    "sequences_post = vectorize_layer(sentences)\n",
    "\n",
    "# Print the results\n",
    "print('INPUT:')\n",
    "print(sentences)\n",
    "print()\n",
    "\n",
    "print('OUTPUT:')\n",
    "print(sequences_post)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "EHqYAVmNAi5D"
   },
   "source": [
    "If you want pre-padding, you can use the [pad_sequences()](https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences) utility to prepend a padding token to the sequences. Notice that the `padding` argument is set to `pre`. This is just for clarity. The function already has this set as the default so you can opt to drop it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "qljgx1eSlEse"
   },
   "outputs": [],
   "source": [
    "# Pre-pad the sequences to a uniform length.\n",
    "# You can remove the `padding` argument and get the same result.\n",
    "sequences_pre = tf.keras.utils.pad_sequences(sequences, padding='pre')\n",
    "\n",
    "# Print the results\n",
    "print('INPUT:')\n",
    "[print(sequence.numpy()) for sequence in sequences]\n",
    "print()\n",
    "\n",
    "print('OUTPUT:')\n",
    "print(sequences_pre)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GsgHsYP2DDrQ"
   },
   "source": [
    "If you switch the `padding` argument to `post`, you will arrive at the same result as applying the layer directly.\n",
    "\n",
    "The function also has a `maxlen` argument that you can use to truncate tokens from the sequences. By default, it will drop tokens in front. If you want to drop tokens at the other end, you will have to set the `truncating` argument to `post`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "70dmSqBDAbYH"
   },
   "outputs": [],
   "source": [
    "# Post-pad the sequences and limit the size to 5.\n",
    "sequences_post_trunc = tf.keras.utils.pad_sequences(sequences, maxlen=5, padding='pre')\n",
    "\n",
    "# Print the results\n",
    "print('INPUT:')\n",
    "[print(sequence.numpy()) for sequence in sequences]\n",
    "print()\n",
    "\n",
    "print('OUTPUT:')\n",
    "print(sequences_post_trunc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another way to prepare your sequences for prepadding is to set the TextVectorization to output a ragged tensor. This means the output will not be automatically post-padded. See the output sequences here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set the layer to output a ragged tensor\n",
    "vectorize_layer = tf.keras.layers.TextVectorization(ragged=True)\n",
    "\n",
    "# Compute the vocabulary\n",
    "vectorize_layer.adapt(sentences)\n",
    "\n",
    "# Apply the layer to the sentences\n",
    "ragged_sequences = vectorize_layer(sentences)\n",
    "\n",
    "# Print the results\n",
    "print(ragged_sequences)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With that, you can now pass it directly to the `pad_sequences()` utility."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pre-pad the sequences in the ragged tensor\n",
    "sequences_pre = tf.keras.utils.pad_sequences(sequences)\n",
    "\n",
    "# Print the results\n",
    "print(sequences_pre)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "btEb9jI0k7Ip"
   },
   "source": [
    "## Out-of-vocabulary tokens\n",
    "\n",
    "Lastly, you'll see what the other special token is for. The layer will use the token index `1` when you have input words that are not found in the vocabulary list. For example, you may decide to collect more text after your initial training and decide to not recompute the vocabulary. You will see this in action in the cell below. Notice that the token `1` is inserted for words that are not found in the list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "4fW1NWTok72V"
   },
   "outputs": [],
   "source": [
    "# Try with words that are not in the vocabulary\n",
    "sentences_with_oov = [\n",
    "    'i really love my dog',\n",
    "    'my dog loves my manatee'\n",
    "]\n",
    "\n",
    "# Generate the sequences\n",
    "sequences_with_oov = vectorize_layer(sentences_with_oov)\n",
    "\n",
    "# Print the integer sequences\n",
    "for sentence, sequence in zip(sentences_with_oov, sequences_with_oov):\n",
    "  print(f'{sentence} ---> {sequence}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UBlQIPBqskAJ"
   },
   "source": [
    "This concludes another introduction to text data preprocessing. So far, you've just been using dummy data. In the next exercise, you will be applying the same concepts to a real-world and much larger dataset."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "private_outputs": true,
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
